Optimal Detection of Sequence Similarity by Local Alignment 
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ABSTRACT 

The statistical properties of local alignment algorithms with 
gaps are analyzed theoretically for uncorrelated and corre- 
lated random sequences. In the vicinity of the log-linear 
phase transition, the statistics of alignment with gaps is 
shown to be characteristically different from that of gap- 
less alignment. The optimal scores obtained for uncorre- 
lated sequences obey certain robust scaling laws. Deviation 
from these scaling laws signals sequence homology, and can 
be used to guide the empirical selection of scoring parame- 
ters for the optimal detection of sequence similarities. This 
can be accomplished in a computationally efficient way by 
using a novel approach focusing on the score profiles. Fur- 
thermore, by assuming a few gross features characterizing 
the statistics of underlying sequence-sequence correlations, 
quantitative criteria are obtained for the choice of optimal 
scoring parameters: Optimal similarity detection is most 
likely to occur in a region close to the log side of the log- 
linear phase transition. 

Keywords: sequence alignment; homology; optimization; 
phase transition 

1 INTRODUCTION 

Sequence alignment is a vital tool in molecular biology. It 
has been used extensively in discovering and understanding 
the functional and evolutionary relationships among genes 
and proteins jl^, |37| . There are two basic types of alignment 
algorithms: algorithms without gaps, e.g., BLAST and 
FASTA Q, and algorithms with gaps, e.g., the Needleman- 
Wunsch algorithm |£7j and the Smith- Waterman algorithm J30| ] . 
Gapless alignment is widely used in large-scale database 
searches because the algorithms are fast 111 , the results de- 
pend only weakly on the choice of scoring systems Q], and 
the statistical significance of the results is well-characterized 
PH p2fl . However, gapless alignment is not sufficient for the 
detection of weak sequence similarities jis| . For the detailed 
analysis of such sequences, algorithms with gaps are nec- 
essary jT|, H II- Advancing our understanding of the 
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statistics of gapped alignment could therefore be critical to 
the wider usage of these more powerful tools. 

A notorious difficulty for any alignment is the selection 
of scoring schemes and/or parameters: In a generic sequence 
matching problem, a score is assigned to each alignment of 
given sequences, based on the total number of matches, mis- 
matches, gaps, etc. Maximization of this score defines an 
optimal alignment. However, it is well known that the op- 
timal alignment of given sequences strongly depends on the 
particular scoring scheme and/or parameters used. Con- 
sequently, the fidelity of an alignment, i.e., the extent to 
which the alignment captures mutual correlations among 
the aligned sequences, can depend strongly on the choice 
of scoring parameters. Understanding the influence of these 
parameters on the resulting alignment and choosing the ap- 
propriate parameters are therefore important for the proper 
application of these algorithms. This requires the knowl- 
edge of the statistics of alignment results, which has been 
obtained only for gapless alignments [^J [l|, |5l], |2^] . For align- 
ments with gaps, appropriate parameters have so far been 
chosen mostly by trial and error, although there have been 
systematic efforts to establish a more solid empirical foot- 
ingiil. 

Recently fllq, [111 we have analyzed the statistical prop- 
erties of global alignment with gaps. Such algorithms align 
sequences of similar lengths in their entirety. By exploit- 
ing mathematical analogies to certain well-studied problems 
of statistical mechanics [ [t9"| , |l6| , E^ , p3| , we have obtained 
a quantitative description of the global alignment statis- 
tics for mutually uncorrelated as well as correlated sequence 
pairs. Here we extend the analysis to local alignment algo- 
rithms 30) which find the best match between contiguous 
subsequences, subject to (finite) penalties for gaps and mis- 
matches. For uncorrelated random sequences, i.e., for in- 
dependent sequences with iid or Markov letters, it is well 
known that depending on the choice of scoring parame- 
ters, the length of the optimal subsequence alignment de- 
pends either linearly or logarithmically on the total length 
of the sequences Jji| 0|. A phase transition line separates 
the space of scoring parameters into two regimes: the "lin- 
ear phase" for small gap and mismatch costs, and the "log 
phase" for large penalty costs. It is clear that local align- 
ment deep in the linear phase is equivalent to global align- 
ment (and hence described by the results of our previous 
studies). On the other hand, the log phase at high gap 
penalty becomes indistinguishable from the log phase of gap- 
less alignment. Indeed there have been extensive empirical 
efforts |3l],0, ^9| 40,^| to characterize the statistics of the 
log phase of gapped algorithms by an effective description 
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as gapless alignment with modified parameters. While this 
approach is reasonable far away from the phase transition, 
it becomes questionable as the phase transition line is ap- 
proached, since the linear phase itself is completely different 
from the log phase. On the other hand, the loci of scoring 
parameters for optimal similarity detection appear to lie in 
the log phase close to the phase transition line, according 
to recent empirical studies by Vingron and Waterman [ 32|. 
Hence, understanding the log-linear phase transition is cru- 
cial for optimizing the detection of sequence similarity and 
quantifying its statistical significance. 

In this work, we apply the well-established theory of 
phase transition Q to the log-linear transition of gapped 
local alignment. We find various statistical properties at 
and in the vicinity of the transition to be governed by scal- 
ing laws analogous to those recently discovered for global 
alignment [Ll[. The transition turns out to differ qual- 
itatively and quantitatively from its counterpart in gapless 
alignments. Our results lead to quantitative criteria for the 
optimal choice of scoring parameters, given certain gross 
statistical characteristics of the expected sequence correla- 
tion. In particular, they explain why optimal parameters 
for weakly correlated sequence pairs are in the vicinity of 
the phase transition line as observed by Vingron and Wa- 
terman 63]. Also emerging from this work is a versatile 
method to detect sequence correlation and characterize its 
statistical significance empirically for sequences with a pri- 
ori unknown correlations. 



2 REVIEW OF ALIGNMENT ALGORITHMS 

We study the Smith- Waterman (SW) local alignment algo- 
rithm applied to a pair of long nucleotide sequences V\ and 
P 2 , with lengths Ni ~ N 2 > 1. Let P n>i € {A,T,G,C} 
be the i th element of the sequence V n . A particular align- 
ment consists of an ordered set of pairings of two elements 
(Pi,i, P2,j), or of an element with a gap, for any contiguous 
subsequence of length l\ < Ni in sequence Vi and length 
£2 < N2 in sequence V2 [see Fig. 1]. The simplest scoring 
system assigns a positive score of +1 if the two elements 
paired together are identical, and a negative score of — fj, if 
the two are different. Each pairing of an element with a gap 
is penalized with a negative score —8. (In the simple case 
considered here, we shall not distinguish between gap initia- 
tion and gap extension. It is then sufficient to consider only 
the region 28 > (J^.) The sum of the scores of all individ- 
ual pairings of a given alignment is the total score for that 
alignment. An optimal alignment is one for which the to- 
tal score is maximized for a given set of scoring parameters 
(/x, 5). The SW algorithm uses the dynamic programming 
method to find the optimal alignment of all possible subse- 
quences. Key to the algorithm is the uniqu e re presentation 
of an alignment by a directed path (see Ref. |2j] and Fig. 1). 
Let us briefly recall the algorithm below, cast in a slightly 
different notation to facilitate the subsequent analysis. 

Consider the alignment lattice shown in Fig. 1. We label 
each lattice point by the coordinate (x,z), with the lower 
tip of the lattice anchored at (0,0). The highest total score 
of all alignment paths ending at a point (x, z) is denoted by 
h(x,z). Given h(x,z) and h(x,z — 1) for all x, h(x,z + 1) 

1 For 28 < fi , it is always favorable to replace a mismatch by two 
gaps, so the outcome of an alignment becomes independent of [i. 
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Figure 1: One possible local alignment of the two nucleotide 
sequences, Ti = GGACATA... and V2 = CGTATAG... 
The aligned subsequences are shown in boldface, with 4 
pairings (three matches, one mismatch) and one gap. This 
alignment can be represented uniquely as a directed path 
on the alignment lattice; each left (right) turn of the path 
correspond to a gap insertion in sequence V\ {P2)- 



can be computed from the recursion relation 



h(x, z + 1) = max ■ 



h(x + 1, z) — S 
h{x — 1, z) — 5 
h(x, z — 1) + u(x, z) 
h 



(1) 



with 8 being the gap insertion cost, u being the match/mismatch 
score to be specified below, and ho being a cutoff score. The 
SW algorithm has ho = 0, which effectively deletes the seg- 
ment of the alignment path connecting (0, 0) and (x, z) if 
h(x,z) < 0. In contrast, the global alignment algorithm 
of Needleman and Wunsch has no cutoff, corresponding to 
the limit ho — > —00. Also, gapless local alignment (with 
8 — > 00) corresponds to the limit of the recursion relation 
(lit) involving only one value of x, say x = 0. The scoring 
function u(x, z) gives the match/mismatch score of aligning 
the elements Pi i ; with P2 , j , with x — i — j and z = i + j — 1. 
For simplicity, we use in this study the form 



u(x 



-j, z = 



if Pi,i = P2,j 
if P14 + Ph,j 



(2) 



More elaborate forms of the scoring function are easily in- 
corporated and do not change key results of this study. 

If z and x are regarded as "time" and "space" variables, 
respectively, the recursion relation (|l|) can be viewed as a 
"dynamical process" describing the time evolution of the 
one- dimensional "score profile" h(x,z). (Similarly, gapless 
local alignment involving only the site x — corresponds 
to the "zero-dimensional" limit.) This dynamic analogy will 
be pivotal in guiding the ensuing analysis. 



3 GLOBAL ALIGNMENT 
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3.1 Statistics of Uncorrelated Sequences: Universal Scal- 
ing Laws 

Let us first review some of the relevant results we previ- 
ously obtained for the global alignment of random sequence 
pairs In Fig. 2(a), we show several representative constant- 
z slices of the score profile h(x) obtained by iterating Eq. ([[]) 
with (fJ,,S) — (0.5,2.0). The alignment algorithm is applied 
to one pair of random sequences each of length N = 10000. 
Results are shown for a central rectangular region^ of the 
alignment lattice, -X/2 < x < X/2 and X/2 < z < 
2N — X/2 with X < N, starting from the initial condition 
h(x, z — X/2) = 0. It will be convenient to use a shifted 
time-like variable, t = z — X/2. In Fig. 2(a), we see a series 
of disorderly score profiles, with the "spatial" average 



h(t) = X' 



X/2 

E 

c=-X/2 



(3) 



advancing steadily in t. For large t, we obtain the linear 
dependence, h(t) = vo(fJ,,S)t (not shown). The value of the 
rate vq itself is not important for global alignment. (Thus, 
h{t) could as well be decreasing linearly in t.) More sig- 
nificant is the spatial variation in the profile which always 
increases for increasing t. This is more clearly seen by plot- 
ting the effective width of the profile, w(t), defined as 



X/2 

E 

= -X/2 



h{t)Y 



(4) 



or alternatively, the difference between the highest and low- 
est point of the profile, Ah(t) — /i ma x(i) — ft-min(t); see 
Fig. 2(b). 

The roughness of the profile, as quantified by either w(t) 
or Ah(t), is an important characteristic of the alignment, 
since it indicates how strongly the optimal alignment dom- 
inates over the suboptimal alignments. The score profiles 
of Fig. 2(a) show the weak dominance of the optimal align- 
ment and the existence of a large number of suboptimal 
alignments. The statistics of these suboptimal alignments 
has been recognized recently as a valuable tool in sequence 
alignment; for an interesting recent exposition, see Ref. [^3]| . 
From Fig. 2(b), it appears that the roughness grows with a 
sub- linear power in t. This is verified in Fig. 3(a), where we 
show the ensemble average w(t) for different sets of scoring 
parameters (/x, S), each curve averaged over 1000 pairs of 
uncorrelated random sequences. (Throughout the text, we 
use overbars to denote averages of an ensemble of random 
sequence pairs.) It is seen that for large t, the width obeys 
the asymptotic scaling law 



w(t) = B{fi,S)t u 



(5) 



Different parameter choices only affect the prefactor £?, but 
not the exponent uj w 1/3. The same scaling law (with a 
larger coefficient) is found also for Ah(t). 

The scaling law (ph is an example of the "universal scal- 
ing phenomena" well studied in statistical mechanics 

2 Due to boundary effects arising from the diamond-shaped align- 
ment lattice (Fig. 1) , the total score h[x, t) certainly decreases 
(quadratically on average) as one moves far away from the center at 
x — 0. To remove these spurious effects, we focus our attention only 
to the central strip of the alignment lattice, e.g., for —X/2 < x < X/2 
and X/2 < z < 2N - X/2., with X < N. All results reported here are 
obtained using this strip geometry. For long sequences, the statistics 
of the score profile obtained from the strip is indistinguishable from 
that obtained with the full alignment lattice. 
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Figure 2: (a) A typical series of score profiles h(x,t) ob- 
tained from the global alignment of a pair of uncorrelated 
random sequences; parameters used are (/i, S) = (0.5,2.0). 
(b) Gradual increase in the "roughness" of the profile, as 
characterized by either the width w(t) or the range Ah(t) 
over the range of x shown in (a). The straight lines indicate 
the suggested power law dependence. 



The exponent u> is a very robust characteristics of random 
sequence alignment. It quantifies the roughness of the profile 
and hence the dominance of the optimal alignment. It does 
not depend on details of the scoring function, (e.g., whether 
or not gap initiation and extension are differentiated) but 
only on a few qualitative characteristics such as the number 
of sequences aligned or the type of the correlations between 
them. 

In Ref. UM , we have given arguments suggesting that the 
result uj = 1/3 is exact for the global alignment of random 
sequence pairs. Our approach is based on the close analogy 
of the recursion relation (jl]) and discrete models of kinetic 
surface growth studied extensively in statistical mechanics in 
the past decade^]. In these growth models, h(x, t) describes 
the height profile of a one-dimensional surface at time t. 
The growth is driven by a stochastic process that governs 
the deposition and removal of material on the surface and 
is described by the random variable rj(x,t) — ^u(x,t) — S. 
The large scale behavior of the profile h(x, t) generated by 
the growth model is well described by the differential growth 

3 General reviews of the kinetic growth problem and its rclatie« 
to thp-problem of first passage percolation can be found in Rcfs. [G4| 
and 
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Figure 3: (a) The ensemble averaged roughness (width) 
for different sets of scoring parameters, each averaged over 
1000 pairs of randomly generated sequences of length N = 
10, 000. The dashed straight line indicates the expected 
power law form, (b) The ensemble averaged spatial cor- 
relation function C(x,t) for ((i, 5) = (0.5,2.0). The dashed 
line indicates the expected scaling behavior at large t. For 
small t's, C(x,t) saturates to the order of w 2 (t). 



equation 



dh d 2 h A fdh\ 2 , , 



(6) 



with the coefficients v and A being functions of 8 and (i. 
Eq. (|6]) is known as the Kardar-Parisi-Zhang equation, closely 
related to the noise-driven Burgers' equation flifl . If r](x,t) 
is an uncorrelated Gaussian random variable, the stationary 
state (t — » oo) of the surface can be obtained exactly [E4|, 
resulting in the Gaussian equal-time distribution 



P{h(x,t — > oo)} oc e 



■J2Jh(x + l,t)-h(x,t)] 2 



(7) 



and a coefficient D((j,,S). Approaching the steady state, 
one has w(t > 1) = Bf, with B ps D 2/3 and u = 1/3 
exactly. For sequence alignment, the variables r/(x, t) and 
ri(x',t') at different points are not independent given their 
definition above. One can verify for example that the higher 
moments of rj are long-range correlated. However, as argued 
in Ref. [[l8| and [Hwa and Lassig (to be published)], these 
correlations do not affect the asymptotic scaling behaviorfj. 



More detailed discussions arc given in Ref. 
number of closely related physics problem. 



in the context of a 



The close correspondence of the numerical results on w(t) 
(Fig. 3(a)) and the exact result u) = 1/3 supports this con- 
clusion. An independent check is to measure directly the 
equal-time correlation of h, C(x, t) — [h(x + x' , t) — h(x', t)] 2 . 
From Eq. (^), we expect to have C(x,t) = D\x\ for suffi- 
ciently large t (or sufficiently small x). For finite t, C(x,t) 

must eventually saturate, to a value ~ w 2 (t) for x > O (t 2,/3 ) . 

The numerically obtained forms of C(x, t) are shown in Fig. 3(b) 
for (fi,6) — (0.5,2.0). The results are in good agreement 
with the anticipated form. Similar results are obtained for 
other values of the scoring parameters. 

3.2 Correlated Sequences: Similarity Detection from the 
Score Profile 

To illustrate how the above knowledge can be used for the 
purpose of similarity detection, we next describe the score 
profile for the global alignment of correlated sequences. Se- 
quence correlations are obtained by first generating a ran- 
dom "ancestor sequence" , and then making imperfect repli- 
cas V\ and Vi. The degree of sequence-sequence corre- 
lation is quantified by the average number of point sub- 
stitutions per base p s and the average number of indels 
per base p t made in the replication of each daughter se- 
quence. (See Refs. |is| , pH for details.) In Fig. 4(a), we 
show the score profile~7i(a;, t) and the range Ah(t) for a pair 
of heavily mutated daughter sequences with p a = 40% and 
pt = 20%, corresponding to a fraction of < 30% conserved 
elements. The score profiles obtained are very different from 
the profiles characteristic of uncorrelated random sequences 
shown in Fig. 2(a). For correlated sequences (Fig. 4(a)), a 
peak emerges from the disordered background after some 
time. The height of the peak advances steadily at a rate v 
which exceeds the growth rate of the background vq. Thus, 
the peak gradually broadens to engulf the entire profile. 
The dominance of the central peak reflects the existence 
of strongly preferred alignments for correlated sequences, in 
marked contrast to the alignment of random sequences^. 

The existence of the peak can be taken as a manifesta- 
tion of inter-sequence correlations. The strength of sequence 
correlation detected is quantified by the difference between 
the growth rate of the peak and the background, e = v — vo- 
The magnitude of e is clearly dependent on the mutation 
rates p a and pt, but also depends on the scoring parameters 
(i and S. The functional form of e((i, 5;p a ,pt) has recently 
been investigated in detail |ll| and will not be addressed 
here. 

A peak in the score profile is discernible from the back- 
ground only if the height of the peak, of the order e-t after t 
steps, exceeds the roughness of the background ~ B-t 1 ^ 3 (see 
Fig. 4(b)). Hence, using the score profile approach, one can 
detect correlations between sequences if their lengths exceed 
the threshold length 



t c ((i,5;p s ,pt) ~ [B((i,S)/£(n,8-p s ,p t )} 3/2 . 



(8) 



Minimization of t c (for a given sequence pair) is a natural 
empirical criterion for optimal similarity detection, yield- 
ing a preferred choice of scoring parameters 8* (fi;p s ,pt). A 

5 In statistical mechanics, one considers the free energy, which is 
the negative of the score. The score profile discussed here is a measure 
of the "energy landscape". A peak in the score profile corresponds to 
a valley or a "funnel" in the energy landscape. It has been suggested 
that the energy landscape je£ hctcropolymcric systems such as a pro- 
tein has the funnel shape [Eqj. The score profile (Fig. 4(a)) obtained 
here is the first known example of a large hctcropolymcric system for 
which the suggested landscape is directly observed. 
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Figure 4: (a) The score profiles obtained from the global 
alignment of a pair of weakly correlated sequences (with 
conserved fraction < 30%) are characteristically different in 
appearance from those of the uncorrelated sequences shown 
in Fig. 2(a). (b) The roughness corresponding to the profiles 
of (a) begins to deviate from the t 1 ^ 3 behavior at t ~ 2000. 



more fundamental criterion for choosing the optimal scor- 
ing parameters is to maximize the fidelity of the alignment, 
i.e., the extent to which the optimal alignment reconstructs 
the ancestor sequence Eq|. The dependence of fidelity on 
scoring parameters has Deen studied in detail in Ref. 
It was found that maximizing the fidelity is closely related 
(and equivalent for practical purposes) to minimizing the 
threshold length t c . Thus, the empirical criterion based on 
the score profile is indeed a versatile way of locating the 
optimal parameters. 

The above strategy of similarity detection is close in 
spirit to the well-known "shuffling method", which com- 
pares the alignment score of a given sequence pair to the 
distribution of scores obtained from aligning the ensemble 
of randomly shuffled sequences (for the same set of scor- 
ing parameters). However, by making use of the spatially- 
extended score profile and its time evolution (as opposed to 
keeping track of only the optimal score), we have demon- 
strated that this comparison can be accomplished by one 
single alignment. This is a drastic improvement over the 
shuffling method, which requires the generation and align- 
ment of an ensemble of sequences. It should thus be possi- 
ble to minimize t c (jM,S) empirically for sequence pairs with 
a priori unknown correlation, since the value of t c can be 



obtained directly from the onset of score peak shown in 
Fig. 4(b), without any assumption on the nature of sequence 
correlations. It would be particularly interesting to combine 
this approach with the efficient ensemble and parametric 
alignment algorithms |3(| |l4|, Q which find optimal scores 
for all parameters. 

4 LOCAL ALIGNMENT WITH GAPS 

We now describe the statistical properties of local align- 
ment, with ho = in the recursion relation (|l|). Local 
alignment is necessary since often only a subsequence of one 
sequence is correlated with a subsequence of another. Let 
the lengths of the correlated subsequences be l\ ~ I2 ~ £. 
If the positions of the subsequences were known, the cor- 
relations would be detectable by global alignment of these 
subsequences if el > BE 1 / 3 , i.e., if I > t c . However, if global 
alignment is applied to the entire sequences, the background 
noise BN 1/I3 will in general overwhelm the score signal of the 
correlations el. The advantage of local alignment is that by 
cutting off the length of the aligned segment, it limits the 
background roughness to a finite value such that the corre- 
lation peak can still be detected. 

4.1 Uncorrelated Sequences: the Log-Linear Phase Tran- 
sition 

We first discuss the background, i.e., the local alignment of 
uncorrelated random sequence pairs. Unlike global align- 
ment, the outcome of local alignment depends critically on 
the value of vo, i.e., the growth rate of the average score 
of the corresponding global alignment problem. For scor- 
ing parameters (p., S) in the regime where vo(n,5) > 0, the 
score h(x,t) grows without bound also for local alignment, 
and the existence of a cutoff score ho = is immaterial in 
the limit of large t. Thus the asymptotic properties of local 
and global alignments are identical in this regime. (It ac- 
quires the name "linear phase" since the average score h(t) 
advances linearly by definition.) For Vo(p,5) < 0, on the 
other hand, the average h(t) saturates quickly to a constant 
value, and the largest value h(x, t' < t) scales as log(f) due 
to exponentially rare events. As has been pointed out by 
Arratia and Waterman [unpublished], the condition^] 



v (n, S) = 



O) 



defines the phase transition line separating the two regimes 
(see Fig. 5). The statistical properties in the vicinity of this 
phase transition have recently been studied in the context 
of some related physics problems j26| . We will briefly sum- 
marize the results below. 

The phase transition is very similar to that of gapless 
local alignment, except that the £ 1//2 score dependence at 
the transition is replaced by 



h c (t) = hit),. 



■ b(S) t 



1/3 



(10) 



along the critical line p, c (5) for large t, with a coefficient 
b(S) ~ B(n c (8),8) dependent on the location along the 
phase transition line. Slightly away from the critical line_on 
the log side, i.e., for p = fj, — fj, c > where vo oc — ft, < 0, h(t) 

6 Howcvcr, finite-size effects [^| may cause a slight shift in the ap- 
parent transition point. For example, the fits of Figs. 6 and 7 found 
[i c ~ 0.7085 for 8 — 2.0, whereas vq — occurred at /i S3 0.7040 for 
S = 2.0. 
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Figure 5: The critical line fi c (8) separating the log and the 
linear phases is obtained from the condition vo(fi c ,5) = 0. 
(The critical line becomes independent of fi below the line 
25 = (j,.) 



s indistinguishable from h c (t) for small t, saturating eventu- 



1/2 



ally to a constant value hunt ~ 6 3//2 /|uo 
scale i x is obtained from the condition h c (t 
yielding 

tx ~ (VN) 3/a « \V>\ 



The saturation 



-3/2 



(11) 

This is the length of the the optimally aligned subsequences 
selected by local alignment. For sequence lengths N much 
longer than t x , it is reasonable to approximate the score 
statistics by that of aavless local alignment, as has been 
attempted previously (5IJ [?], ^jjl- However, because 

t x diverges as the critical point (pt = Oj is approached, the 
gapless approximation becomes invalid for t x > N, or for 
Jj, < N~ 2 ^ 3 . On the other side of the critical line where 
"p, < and wo > 0, h(t) again equals h c (t) for t <t x , before 
changing to the linear dependence for larger values of t. The 
statistics of global alignment applies to the linear phase on 
scales t > t x . _ 

The above behavior of h(t) can be summarized by the 
homogeneous scaling relation 



h(t;-fi) = \ji\- 1 ' 2 f ± (tW) 



;|3/2s 



(12) 



with the two branches of the scaling functions /± (x) having 
the limiting forms f±(x -C 1) = bx 1 ^ 3 , f+(x 2> 1) — > const 
for Jl > and f-(x S> 1) oc x for Jl < Fj Such scaling 
relations are widely encountered in physicafsystems under- 
going continuous phase transitions and have been studied 
extensively by the modern theory of critical phenomena || . 

In Fig. 6(a), we show the score average h(t) for local 
alignment of 1000 random sequence pairs of length 10000, 
for 5 — 2 and various /i's taken ±5% around t he critical 
value fjL c £s 0.7085. The anticipated scaling form (^|) can be 
checked by multiplying the horizontal axis of Fig. 6(a) by a 
factor |^| 3 ^ 2 and the vertical axis by |pt| 1//2 individually for 
each curve. The result is shown in Fig. 6(b). The 11 curves 
displayed in Fig. 6(a) (each containing 10 data points) col- 
lapse into two branches, corresponding to the two branches 
of the scaling function f±. The upper branch (/_) describes 



It should be noted that the scaling form (121) also describes 
the score /i ga pless(i; °f the zero-dimensional problem of gapless lo- 
cal alignment in the vicinity of its phase transition, but with mod- 
ified exponents. The results h gap i eBE (t; Jl) — \J1\ ~ 1 g± (t \J1\ 2 ), with 
g± {x *C 1) oc :c 1//2 , g+ {x ^> 1) — > ciewst and g-(x > 1) cc a; can be 
straightforwardly verified; see Ref. [Eq] for a detailed discussion. 
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Figure 6: (a) The ensemble averaged score h(t) obtained 
from the local alignment of random sequence pairs, with 
5 — 2.0, and various values of /i slightly above and below 
the critical value /u c (<5 = 2.0) w 0.7085. (The order of the 
curves corresponds to the order shown in the legend, with 
the top curve having the smallest value of /i and the bottom 
curve having the largest value of fi.) Each curve is averaged 
over 1000 pairs of sequences of length 10, 000 each, (b) The 
curves in (a) plotted according to the homogeneous scaling 
form (see text). The dashed lines indicate the anticipated 
power law forms in the two different regimes. 



the crossover of the critical behavior h c {t) to the linear be- 
havior, and the lower branch (/+) describes the crossover 
of the critical behavior towards saturation. The only fit- 
ting parameter for this data collapse is the location of the 
transition point ji c . 

A similar scaling relation exists for the roughness of the 
score profile. In Fig. 7(a), the width w(t) is plotted for 
various values of fj,, approaching the phase transition from 
the log side. The data can be collapsed by the same trans- 
formation as in Fig. 6(b), with the same fitting parameter 
fi c . The results (Fig. 7(b)) show clearly that w(t;JT) has the 
same form as h(t;Jl) in the log phase, i.e., the lower branch 
of Fig. 6(b). In particular, w(t > t x ) saturates to a constant 
value w sat of the same order as /i sat , yielding 



bt 



1/3 



(13) 



This is just the expression (g) describing the score roughness 
of global alignment, evaluated at t — t x . It is a manifestation 
of the general result that local alignment corresponds to 
global alignment applied to the selected subsequences. 
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Figure 7: (a) The ensemble averaged roughness w(t) in the 
log phase, for S = 2.0 and fi = 0.709, 0.710, 0.712, 0.715, 
0.720, 0.735, 0.765, 0.825, 0.950 (from top to bottom). The 
sequences are the same as those used in Figs. 6. (b) The 
roughness curves plotted according to the scaling form. 



and (pj|), we find a ~ (e — |wo|) ■ |uo| 1//2 , which is maximized 
at 

«o = v (fJ,*,5*) = --, (16) 

with 

a* oc e 3/2 ~ tj 1 . (17) 

Eqs. (no) and (n/7j) are the central results of this study. 
Eq. (n6f) shows thatfor weakly correlated subsequences (i.e., 
e — ► uJ7 the optimal scoring parameters should be close to 
the log side of the phase boundary. This is a quantitative 
formulation of the empirical observation of Vingron and Wa- 
terman |32]| . In addition to the choice of Vo, Eq. (^) shows 
that t c should still be minimized as discussed in Sec. 3.2 for 
global alignment. These two conditions uniquely determine 
the optimal parameters fi* and 6*. Using the optimal result, 
the condition (Q is reduced to t—to > t* = t c (fi* ,S*;p s ,pt), 
which yields the minimal subsequence length £ for which cor- 
relations can be detected. 

5 SUMMARY 

In this study, we presented a statistical description of the 
Smith-Waterman local alignment algorithm, focusing on the 
properties near the log-linear phase transition line. We 
demonstrated how this knowledge can be exploited to pro- 
vide quantitative criteria guiding the choices of alignment 
parameters for optimal detection of weak sequence correla- 
tions. The optimal values and S* are obtained in terms 
of the substitution rate p s and the indel rate p t characteriz- 
ing the statistics of sequence correlations. By analyzing the 
evolution of the spatially-extended score profile, we are able 
to detect sequence correlations by a single run of the algo- 
rithm. This is a very efficient way to optimize the scoring 
parameters empirically, and may be useful in the alignment 
of a vast number of weakly correlated sequences. 



4.2 Correlated Subsequences: Similarity Detection and 
Parameter Dependence 

The existence of a phase transition in local alignment can 
be used to detect sequence-sequence correlations: Consider 
a pair of sequences Vi and Vi with mutually correlated sub- 
sequences located in the intervals from io to io + 1/2 and 
from jo to jo + £/2, respectively. For concreteness, let the 
similarity of these subsequences be generated by the evo- 
lution mechanism described in Sec. 3.2, with the strength 
of inter-sequence correlations characterized by e(fj,, 5;p 3 ,pt). 
Our strategy for similarity detection is simply to choose the 
scoring parameters S and fj, such that vo(fJ,,S) < 0, i.e., the 
log phase is obtained if the sequences are uncorrelated, while 
keeping vo + £ > 0, so that the linear phase may be ob- 
tained instead for some duration of the correlated subse- 
quences, to < t < I, With this parameter choice, the profile 
h(x, i) will have a constant background roughness w aa ,t for 
t < to = io +jo — 1, and a score peak signaling subsequence 
correlations in the interval to < t < £, once 

(e + v ) ■ (t-t ) > UW- (14) 

Optimal similarity detection is obtained by maximizing 
the peak-to-background ratio, 

a — (e + Wo)/w S at (15) 

which can be taken as a measure of the statistical signifi- 
cance of the correlations detected. Using the relations ( |l3| ) 
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